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Abstract 

The lexical acquisition system presented in this pa- 
per incrementally updates linguistic properties of un- 
known words inferred from their surrounding con- 
text by parsing sentences with an HPSG grammar 
for German. We employ a gradual, information- 
based concept of "unknownness" providing a uni- 
form treatment for the range of completely known to 
maximally unknown lexical entries. "Unknown" in- 
formation is viewed as revisable information, which 
is either generalizable or specializable. Updating 
takes place after parsing, which only requires a mod- 
ified lexical lookup. Revisable pieces of informa- 
tion are identified by grammar-specified declarations 
which provide access paths into the parse feature 
structure. The updating mechanism revises the cor- 
responding places in the lexical feature structures iff 
the context actually provides new information. For 
revising generalizable information, type union is re- 
quired. A worked-out example demonstrates the in- 
ferential capacity of our implemented system. 

1 Introduction 

It is a remarkable fact that humans can often un- 
derstand sentences containing unknown words, in- 
fer their grammatical properties and incrementally 
refine hypotheses about these words when encoun- 
tering later instances. In contrast, many current NLP 
systems still presuppose a complete lexicon. Notable 
exceptions include Zernik (1989j), Erbach (1990), 



tences. It focusses on extracting linguistic proper- 
ties, as compared to e.g. general concept learning 
( |Hahn, Klenner & Schnattinger 1996h. Unlike E r- 



Hastings & Lytinen (1994b . See |Zernik| for an intro- 
duction to the general issues involved. 

This paper describes an HPSG-based system 
which can incrementally learn and refine proper- 
ties of unknown words after parsing individual sen- 
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bach (1990), however, it is not confined to sim- 
ple morpho-syntactic information but can also han- 
dle selectional restrictions, semantic types and argu- 
ment structure. Finally, while statistical approaches 
like |Brent (1991ft can gather e.g. valence informa- 
tion from large corpora, we are more interested in 
full grammatical processing of individual sentences 
to maximally exploit each context. 

The following three goals serve to structure 
our model. It should i) incorporate a gradual, 
information-based conceptualization of "unknown- 
ness". Words are not unknown as a whole, but 
may contain unknown, i.e. revisable pieces of infor- 
mation. Consequently, even known words can un- 
dergo revision to e.g. acquire new senses. This view 
replaces the binary distinction between open and 
closed class words. It should ii) maximally exploit 
the rich representations and modelling conventions 
of HPSG and associated formalisms, with essen- 
tially the same grammar and lexicon as compared 
to closed-lexicon approaches. This is important both 
to facilitate reuse of existing grammars and to en- 
able meaningful feedback for linguistic theorizing. 
Finally, it should iii) possess domain-independent in- 
ference and lexicon-updating capabilities. The gram- 
mar writer must be able to fully declare which pieces 
of information are open to revision. 

The system was implemented using MicroCUF, 
a simplified version of the CUF typed unification 
formalism ( Dorre & Dorna 1993ft that we imple- 
mented in SICStus Prolog. It shares both the feature 
logic and the definite clause extensions with its big 
brother, but substitutes a closed-world type system 
for CUF's open-world regime. A feature of our type 
system implementation that will be significant later 
on is that type information in internal feature struc- 



tures (FSs) can be easily updated. 

The HPSG grammar developed with MicroCUF 
models a fragment of German. Since our focus is on 
the lexicon, the range of syntactic variation treated 
is currently limited to simplex sentences with canon- 
ical word order. We have incorporated some recent 
developments of HPSG, esp. the revisions of Po l- 
lard & Sag (1994, ch. 9), fanning & Sag (1995p 's 
proposal for an independent level of argument struc- 
ture and Bouma (1997 )'s use of argument structure 
to eliminate procedural lexical rules in favour of re- 
lational constraints. Our elaborate ontology of se- 
mantic types - useful for non-trivial acquisition of 
selectional restrictions and nominal sorts - was de- 
rived from a systematic corpus study of a biological 
domain flKnodel 19801 , 154-188). The grammar also 
covers all valence classes encountered in the corpus. 
As for the lexicon format, we currently list full forms 
only. Clearly, a morphology component would sup- 
ply more contextual information from known affixes 
but would still require the processing of unknown 
stems. 

2 Incremental Lexical Acquisition 

When compared to a previous instance, a new sen- 
tential context can supply either identical, more spe- 
cial, more general, or even conflicting information 
along a given dimension. Example pairs illustrating 
the latter three relationships are given under 
(words assumed to be unknown in bold face). 

(1) a. Im Axon tritt ein Ruhepotential auf. 

'a rest potential occurs in the axon' 

b. Das Potential wandert iiber das Axon, 
'the potential travels along the axon' 

(2) a. Das Ohr reagiert auf akustische Reize. 

'the ear reacts to acoustic stimuli' 

b. Ein Sinnesorgan reagiert auf Reize. 
'a sense organ reacts to stimuli' 

(3) a. Die Nase ist fur Geriiche sensibel. 

'the nose is sensitive to smells' 

b. Die sensible Nase reagiert auf Geriiche. 
'the sensitive nose reacts to smells' 

In contrast to (|l|a), which provides the information 
that the gender of Axon is not feminine (via im), the 
context in (|l]b) is more specialized, assigning neuter 
gender (via das). Conversely, (^|b) differs from ([Ja) 
in providing a more general selectional restriction for 
the subject of reagiert, since sense organs include 



ears as a subtype. Finally, the adjective sensibel is 
used predicatively in (||a), but attributively in (^b). 
The usage types must be formally disjoint, because 
some German adjectives allow for just one usage 
(ehemalig 'former, attr.', schuld 'guilty, pred.'). 

On the basis of contrasts like those in ([]])-(|3]) it 
makes sense to statically assign revisable informa- 
tion to one of two classes, namely specializable or 
generalizable. f\ Apart from the specializable kinds 
'semantic type of nouns' and 'gender', the inflec- 
tional class of nouns is another candidate (given a 
morphological component). Generalizable kinds of 
information include 'selectional restrictions of verbs 
and adjectives', 'predicative vs attributive usage of 
adjectives' as well as 'case and form of PP argu- 
ments' and 'valence class of verbs'. Note that spe- 
cializable and generalizable information can cooccur 
in a given lexical entry. A particular kind of informa- 
tion may also figure in both classes, as e.g. seman- 
tic type of nouns and selectional restrictions of verbs 
are both drawn from the same semantic ontology. Yet 
the former must be invariantly specialized - indepen- 
dent of the order in which contexts are processed -, 
whereas selectional restrictions on NP complements 
should only become more general with further con- 
texts. 

2.1 Representation 

We require all revisable or updateable information to 
be expressible as formal types.0 As relational clauses 
can be defined to map types to FSs, this is not much 
of a restriction in practice. Figure |l| shows a rele- 
vant fragment. Whereas the combination of special- 




non_fem fem 
stimulus sense_organ \^ 

/ / \ masc neut 

sound smell nose ear 

Figure 1: Excerpt from type hierarchy 

izable information translates into simple type unifi- 
cation (e.g. non_fem A neut = neut), combining 

'The different behaviour u nderlying this classification has 
previously b een noted by e.g. Erbach (199Cj) and Hastings & 
Lytinen (1994) but received either no implementational status or 
no systematic association with arbitrary kinds of information. 

2 In HPSG types are sometimes also referred to as sorts. 



generalizable information requires type union (e.g. 
pred V attr = prd). The latter might pose problems 
for type systems requiring the explicit definition of 
all possible unions, corresponding to least common 
supertypes. However, type union is easy for (Mi- 
cro)CUF and similar systems which allow for arbi- 
trary boolean combinations of types. Generalizable 
information exhibits another peculiarity: we need a 
disjoint auxiliary type u_g to correctly mark the ini- 
tial unknown information state. [] This is because 
'content' types like prd, pred, attr are to be inter- 
preted as recording what contextual information was 
encountered in the past. Thus, using any of these to 
prespecify the initial value - either as the side-effect 
of a feature appropriateness declaration (e.g. prd) or 
through grammar-controlled specification (e.g. pred, 
attr) - would be wrong (cf. pr ^initial V attr = prd, 
but u_g initia i V attr = u_g V attr). 

Generalizable information evokes another ques- 
tion: can we simply have types like those in fig. [I] 
within HPSG signs and do in-place type union, just 
like type unification? The answer is no, for essen- 
tially two reasons. First, we still want to rule out 
ungrammatical constructions through (type) unifica- 
tion failure of coindexed values, so that generalizable 
types cannot always be combined by nonfailing type 
union (e.g. *der sensible Geruch 'the sensitive smell' 
must be ruled out via sense_organ A smell = _L). 
We would ideally like to order all type unifications 
pertaining to a value before all unions, but this vi- 
olates the order independence of constraint solv- 
ing. Secondly, we already know that a given infor- 
mational token can simultaneously be generalizable 
and specializable, e.g. by being coindexed through 
HPSG's valence principle. However, simultaneous 
in-place union and unification is contradictory. 

To avoid these problems and keep the declarative 
monotonic setting, we employ two independent fea- 
tures gen and ctxt. ctxt is the repository of contex- 
tually unified information, where conflicts result in 
ungrammaticality. gen holds generalizable informa- 
tion. Since all gen values contain u_g as a type dis- 
junct, they are always unifiable and thus not restric- 
tive during the parse. To nevertheless get correct gen 
values we perform type union after parsing, i.e. dur- 
ing lexicon update. We will see below how this works 
out. 



Actually, the situation is more symmetrical, as we need a 
dual type u_s to correctly mark "unknown" specializable infor- 
mation. This prevents incorrect updating of known information. 
However, i/j is unnecessary for the examples presented below. 



The last representational issue is how to identify 
revisable information in (substructures of) the parse 
FS. For this purpose the grammar defines revisability 
clauses like the following: 



(4) a. generalizable([TJ, ID := 



synsem | loc | cat | head 



adj 
prd 



gen [TT 
ctxtH 



b. specializable([Q) := 

cat I head noun 



synsem | loc 



cont | ind | gendQ] 



2.2 Processing 

The first step in processing sentences with unknown 
or revisable words consists of conventional parsing. 
Any HPSG-compatible parser may be used, subject 
to the obvious requirement that lexical lookup must 
not fail if a word's phonology is unknown. A canon- 
ical entry for such unknown words is defined as the 
disjunction of maximally underspecified generic lex- 
ical entries for nouns, adjectives and verbs. 

The actual updating of lexical entries consists of 
four major steps. Step 1 projects the parse FS derived 
from the whole sentence onto all participating word 
tokens. This results in word FSs which are contextu- 
ally enriched (as compared to their original lexicon 
state) and disambiguated (choosing the compatible 
disjunct per parse solution if the entry was disjunc- 
tive). It then filters the set of word FSs by unification 
with the right-hand side of revisability clauses like in 
([J). The output of step 1 is a list of update candidates 
for those words which were unifiable. 

Step 2 determines concrete update values for each 
word: for each matching generalizable clause we 
take the type union of the gen value of the old, lexical 
state of the word (LexGen) with the ctxt value of its 
parse projection (Ctxt) : TU = LexGenUCtxt. For 
each matching specializable(Spec) clause we take 
the parse value Spec. 

Step 3 checks whether updating would make a dif- 
ference w.r.t. the original lexical entry of each word. 
The condition to be met by generalizable information 
is that TU 2 LexGen, for specializable information 
we similarly require Spec C LexSpec. 

In step 4 the lexical entries of words surviving step 
3 are actually modified. We retract the old lexical en- 
try, revise the entry and re-assert it. For words never 
encountered before, revision must obviously be pre- 
ceded by making a copy of the generic unknown en- 
try, but with the new word's phonology. Revision it- 
self is the destructive modification of type informa- 



tion according to the values determined in step 2, 
at the places in a word FS pointed to by the revis- 
ability clauses. This is easy in MicroCUF, as types 
are implemented via the attributed variable mecha- 
nism of SICStus Prolog, which allows us to substi- 
tute the type in-place. In comparison, general updat- 
ing of Prolog-encoded FSs would typically require 
the traversal of large structures and be dangerous if 
structure-sharing between substituted and unaffected 
parts existed. Also note that we currently assume 
DNF-expanded entries, so that updates work on the 
contextually selected disjunct. This can be motivated 
by the advantages of working with presolved struc- 
tures at run-time, avoiding description-level opera- 
tions and incremental grammar recompilation. 

2.3 A Worked-Out Example 

We will illustrate how incremental lexical revision 
works by going through the examples under (||)-(0). 

(5) Die Nase ist ein Sinnesorgan. 
'the nose is a sense organ' 

(6) Das Ohr perzipiert. 
'the ear perceives' 

(7) Eine verschnupfte Nase perzipiert den 

Gestank. 

'a bunged up nose perceives the stench' 

The relevant substructures corresponding to the lex- 
ical FSs of the unknown noun and verb involved 
are depicted in fig. ^. The leading feature paths 
synsem|loc|cont for Nase and synsem|loc|cat|arg-st 
for perzipiert have been omitted. 

After parsing (|J) the gender of the unknown noun 
Nase is instantiated to fern by agreement with the 
determiner die. As the specializable clause (|]b) 
matches and the gend parse value differs from its 
lexical value gender, gender is updated to fern. Fur- 
thermore, the object's semantic type has percolated 
to the subject Nase. Since the object's sense-organ 
type differs from generic initial nom_sem, Nase's ctxt 
value is updated as well. In place of the still nonex- 
isting entry for perzipiert, we have displayed the rel- 
evant part of the generic unknown verb entry. 

Having parsed (||) the system then knows that 
perzipiert can be used intransitively with a nomi- 
native subject referring to ears. Formally, an HPSG 
mapping principle was successful in mediating be- 
tween surface subject and complement lists and the 
argument list. Argument list instantiations are them- 
selves related to corresponding types by a further 
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Ctxt nomsem 

gen u_g\jsmell 

Ctxt nomsem 



args 



loc | cont 



loc I cont 



Figure 2: Updates on lexical FSs 

mapping. On the basis of this type classification of 
argument structure patterns, the parse derived the 
ctxt value npnom. Since gen values are generaliz- 
able, this new value is unioned with the old lexi- 
cal gen value. Note that ctxt is properly unaffected. 
The first (subject) element on the args list itself is 
targeted by another revisability clause. This has the 
side-effect of further instantiating the underspecified 
lexical FS. Since selectional restrictions on nominal 
subjects must become more general with new con- 
textual evidence, the union of ear and the old value 
u_g is indeed appropriate. 

Sentence (f7|) first of all provides more specific evi- 
dence about the semantic type of partially known 
Nase by way of attributive modification through ver- 
schnupfte. The system detects this through the differ- 
ence between lexical ctxt value sense_organ and the 
parse value nose, so that the entry is specialized ac- 
cordingly. Since the subject's synsem value is coin- 
dexed with the first args element, [ctxt nose] simulta- 
neously appears in the FS of perzipiert. However, the 
revisability clause matching there is of class general- 
izable, so union takes place, yielding ear V nose = 
sense-organ (w.r.t. the simplified ontology of fig. 
[I] used in this paper). An analogous match with the 
second element of args identifies the necessary up- 
date to be the unioning-in of smell, the semantic type 
of Gestank. Finally, the system has learned that an 
accusative NP object can cooccur with perzipiert, so 
the argument structure type of gen receives another 
update through union with npnomjnpacc. 



3 Discussion 

The incremental lexical acquisition approach de- 
scribed above attains the goals stated earlier. It re- 
alizes a gradual, information-based conceptualiza- 
tion of unknownness by providing updateable formal 
types - classified as either generalizable or special- 
izable - together with grammar-defined revisability 
clauses. It maximally exploits standard HPSG rep- 
resentations, requiring moderate rearrangements in 
grammars at best while keeping with the standard 
assumptions of typed unification formalisms. One 
noteworthy demand, however, is the need for a type 
union operation. Parsing is conventional modulo a 
modified lexical lookup. The actual lexical revision 
is done in a domain-independent postprocessing step 
guided by the revisability clauses. 

Of course there are areas requiring further consid- 
eration. In contrast to humans, who seem to leap to 
conclusions based on incomplete evidence, our ap- 
proach employs a conservative form of generaliza- 
tion, taking the disjunction of actually observed val- 
ues only. While this has the advantage of not leading 
to overgeneralization, the requirement of having to 
encounter all subtypes in order to infer their com- 
mon supertype is not realistic (sparse-data problem). 
In (Q) sense_organ as the semantic type of the first 
argument of perzipiert is only acquired because the 
simplified hierarchy in fig. |j] has nose and ear as its 
only subtypes. Here the work of Li & Abe (1995 ) 
who use the MDL principle to generalize over the 
slots of observed case frames might prove fruitful. 

An important question is how to administrate 
alternative parses and their update hypotheses. In 
Das Aktionspotential erreicht den Dendriten 'the 
action potential reaches the dendrite(s)', Dendriten 
is ambiguous between acc.sg. and dat.pl., giving 
rise to two valence hypotheses npnomjnpacc and 
npnomjnpdat for erreicht. Details remain to be 
worked out on how to delay the choice between such 
alternative hypotheses until further contexts provide 
enough information. 

Another topic concerns the treatment of 'cooc- 
currence restrictions'. In fig. || the system has in- 
dependently generalized over the selectional restric- 
tions for subject and object, yet there are clear cases 
where this overgenerates (e.g. *Das Ohr perzipiert 
den Gestank 'the ear perceives the stench'). An idea 
worth exploring is to have a partial, extensible list of 
type cooccurrences, which is traversed by a recursive 
principle at parse time. 

A more general issue is the apparent antagonism 



between the desire to have both sharp grammatical 
predictions and continuing openness to contextual 
revision. If after parsing (^) we transfer the fact that 
smells are acceptable objects to perzipiert into the re- 
stricting ctxt feature, a later usage with an object of 
type sound fails. The opposite case concerns newly 
acquired specializable values. If in a later context 
these are used to update a gen value, the result may 
be too general. It is a topic of future research when 
to consider information certain and when to make re- 
visable information restrictive. 
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